Declarative XML Data Cleaning with XClean

نویسندگان

  • Melanie Herschel
  • Ioana Manolescu
چکیده

Data cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration. With the growing amount of XML data, approaches to effectively and efficiently clean XML are needed, an issue not addressed by existing data cleaning systems that mostly specialize on relational data. We present XClean, a data cleaning framework specifically geared towards cleaning XML data. XClean’s approach is based on a set of cleaning operators, whose semantics is well-defined in terms of XML algebraic operators. Users may specify cleaning programs by combining operators by means of a declarative XClean/PL program, which is then compiled into XQuery. We describe XClean’s operators, language, and compilation approach, and validate its effectiveness through a series of case studies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

XClean in Action A Demonstration of Declarative XML Data Cleaning

We demonstrate XClean, a data cleaning system specifically geared towards cleaning XML data. XClean’s approach is based on a set of cleaning operators. Users may specify cleaning programs by combining operators using the declarative XClean/PL language, which is then compiled into XQuery. We plan to show XClean in action on several scenarios based on real-world data. A graphical user interface s...

متن کامل

XClean in Action (Demo)

We demonstrate XClean, a data cleaning system specifically geared towards cleaning XML data. XClean’s approach is based on a set of cleaning operators. Users may specify cleaning programs by combining operators using the declarative XClean/PL language, which is then compiled into XQuery. We plan to show XClean in action on several scenarios based on real-world data. A graphical user interface s...

متن کامل

Duplicate detection in XML data

Duplicate detection consists in detecting multiple representations of a same real-world object, and that for every object represented in a data source. Duplicate detection is relevant in data cleaning and data integration applications and has been studied extensively for relational data describing a single type of object in a single table. Our research focuses on iterative duplicate detection i...

متن کامل

ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments

Extraction-Transformation-Loading (ETL) and Data Cleaning tools are pieces of software responsible for the extraction of data from several sources, their cleaning, customization and insertion into a data warehouse. To deal with the complexity and efficiency of the transformation and cleaning tasks we have developed a tool, namely ARKTOS, capable of modeling and executing practical scenarios, by...

متن کامل

On Data Cleaning In Building XML Data Warehouses

One of the most important aspects in building an XML data warehouse is data cleaning and integration process. This paper presents a detailed methodology for cleaning data and integrating, especially useful for general situations when different-source documents are involved. Both situations whereby the XML documents have an associated XML Schema or they are just independent XML documents are con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007